Image Capture and Identification System and Process

ABSTRACT

A digital image of the object is captured and the object is recognized from plurality of objects in a database. An information address corresponding to the object is then used to access information and initiate communication pertinent to the object.

This application is a divisional of Ser. No. 14/083,210, filed Nov. 18,2013, which is a divisional of Ser. No. 13/856,197, filed Apr. 3, 2013and issued Aug. 5, 2014 as U.S. Pat. No. 8,798,368, which is adivisional of Ser. No. 13/693,983, filed Dec. 4, 2012 and issued Apr.29, 2014 as U.S. Pat. No. 8,712,193, which is a continuation of Ser. No.13/069,112, filed Mar. 22, 2011 and issued Dec. 4, 2012 as U.S. Pat. No.8,326,031, which is a divisional of Ser. No. 13/037,317, filed Feb. 28,2011 and issued Jul. 17, 2012 as U.S. Pat. No. 8,224,078, which is adivisional of Ser. No. 12/333,630, filed Dec. 12, 2008 and issued Mar.1, 2011 as U.S. Pat. No. 7,899,243, which is a divisional of Ser. No.10/492,243, filed May 20, 2004 and issued Jan. 13, 2009 as U.S. Pat. No.7,477,780, which is a National Phase of PCT/US02/35407, filed Nov. 5,2002, which is an continuation-in-part of Ser. No. 09/992,942, filedNov. 5, 2001 and issued Mar. 21, 2006 as U.S. Pat. No. 7,016,532, whichclaims priority to provisional application No. 60/317,521 filed Sep. 5,2001 and provisional application No. 60/246,295 filed Nov. 6, 2000.These and all other referenced patents and applications are incorporatedherein by reference in their entirety. Where a definition or use of aterm in a reference that is incorporated by reference is inconsistent orcontrary to the definition of that term provided herein, the definitionof that term provided herein is deemed to be controlling.

TECHNICAL FIELD

The invention relates an identification method and process for objectsfrom digitally captured images thereof that uses image characteristicsto identify an object from a plurality of objects in a database.

BACKGROUND ART

There is a need to provide hyperlink functionality in known objectswithout modification to the objects, through reliably detecting andidentifying the objects based only on the appearance of the object, andthen locating and supplying information pertinent to the object orinitiating communications pertinent to the object by supplying aninformation address, such as a Uniform Resource Locator (URL), pertinentto the object.

There is a need to determine the position and orientation of knownobjects based only on imagery of the objects.

The detection, identification, determination of position andorientation, and subsequent information provision and communication mustoccur without modification or disfigurement of the object, without theneed for any marks, symbols, codes, barcodes, or characters on theobject, without the need to touch or disturb the object, without theneed for special lighting other than that required for normal humanvision, without the need for any communication device (radio frequency,infrared, etc.) to be attached to or nearby the object, and withouthuman assistance in the identification process. The objects to bedetected and identified may be 3-dimensional objects, 2-dimensionalimages (e.g., on paper), or 2-dimensional images of 3-dimensionalobjects, or human beings.

There is a need to provide such identification and hyperlink services topersons using mobile computing devices, such as Personal DigitalAssistants (PDAs) and cellular telephones.

There is a need to provide such identification and hyperlink services tomachines, such as factory robots and spacecraft.

Examples include:

identifying pictures or other art in a museum, where it is desired toprovide additional information about such art objects to museum visitorsvia mobile wireless devices;

provision of content (information, text, graphics, music, video, etc.),communications, and transaction mechanisms between companies andindividuals, via networks (wireless or otherwise) initiated by theindividuals “pointing and clicking” with camera-equipped mobile deviceson magazine advertisements, posters, billboards, consumer products,music or video disks or tapes, buildings, vehicles, etc.;

establishment of a communications link with a machine, such a vendingmachine or information kiosk, by “pointing and clicking” on the machinewith a camera-equipped mobile wireless device and then execution ofcommunications or transactions between the mobile wireless device andthe machine;

identification of objects or parts in a factory, such as on an assemblyline, by capturing an image of the objects or parts, and then providinginformation pertinent to the identified objects or parts;

identification of a part of a machine, such as an aircraft part, by atechnician “pointing and clicking” on the part with a camera-equippedmobile wireless device, and then supplying pertinent content to thetechnician, such maintenance instructions or history for the identifiedpart;

identification or screening of individual(s) by a security officer“pointing and clicking” a camera-equipped mobile wireless device at theindividual(s) and then receiving identification information pertinent tothe individuals after the individuals have been identified by facerecognition software;

identification, screening, or validation of documents, such aspassports, by a security officer “pointing and clicking” acamera-equipped device at the document and receiving a response from aremote computer;

determination of the position and orientation of an object in space by aspacecraft nearby the object, based on imagery of the object, so thatthe spacecraft can maneuver relative to the object or execute arendezvous with the object;

identification of objects from aircraft or spacecraft by capturingimagery of the objects and then identifying the objects via imagerecognition performed on a local or remote computer;

watching movie previews streamed to a camera-equipped wireless device by“pointing and clicking” with such a device on a movie theatre sign orposter, or on a digital video disc box or videotape box;

listening to audio recording samples streamed to a camera-equippedwireless device by “pointing and clicking” with such a device on acompact disk (CD) box, videotape box, or print media advertisement;

purchasing movie, concert, or sporting event tickets by “pointing andclicking” on a theater, advertisement, or other object with acamera-equipped wireless device;

purchasing an item by “pointing and clicking” on the object with acamera-equipped wireless device and thus initiating a transaction;

interacting with television programming by “pointing and clicking” atthe television screen with a camera-equipped device, thus capturing animage of the screen content and having that image sent to a remotecomputer and identified, thus initiating interaction based on the screencontent received (an example is purchasing an item on the televisionscreen by “pointing and clicking” at the screen when the item is on thescreen);

interacting with a computer-system based game and with other players ofthe game by “pointing and clicking” on objects in the physicalenvironment that are considered to be part of the game;

paying a bus fare by “pointing and clicking” with a mobile wirelesscamera-equipped device, on a fare machine in a bus, and thusestablishing a communications link between the device and the faremachine and enabling the fare payment transaction;

establishment of a communication between a mobile wirelesscamera-equipped device and a computer with an Internet connection by“pointing and clicking” with the device on the computer and thusproviding to the mobile device an Internet address at which it cancommunicate with the computer, thus establishing communications with thecomputer despite the absence of a local network or any directcommunication between the device and the computer;

use of a mobile wireless camera-equipped device as a point-of-saleterminal by, for example, “pointing and clicking” on an item to bepurchased, thus identifying the item and initiating a transaction.

DISCLOSURE OF INVENTION

The present invention solves the above stated needs. Once an image iscaptured digitally, a search of the image determines whether symboliccontent is included in the image. If so the symbol is decoded andcommunication is opened with the proper database, usually using theInternet, wherein the best match for the symbol is returned. In someinstances, a symbol may be detected, but non-ambiguous identification isnot possible. In that case and when a symbolic image can not bedetected, the image is decomposed through identification algorithmswhere unique characteristics of the image are determined. Thesecharacteristics are then used to provide the best match or matches inthe data base, the “best” determination being assisted by the partialsymbolic information, if that is available.

Therefore the present invention provides technology and processes thatcan accommodate linking objects and images to information via a networksuch as the Internet, which requires no modification to the linkedobject. Traditional methods for linking objects to digital information,including applying a barcode, radio or optical transceiver ortransmitter, or some other means of identification to the object, ormodifying the image or object so as to encode detectable information init, are not required because the image or object can be identifiedsolely by its visual appearance. The users or devices may even interactwith objects by “linking” to them. For example, a user may link to avending machine by “pointing and clicking” on it. His device would beconnected over the Internet to the company that owns the vendingmachine. The company would in turn establish a connection to the vendingmachine, and thus the user would have a communication channelestablished with the vending machine and could interact with it.

The decomposition algorithms of the present invention allow fast andreliable detection and recognition of images and/or objects based ontheir visual appearance in an image, no matter whether shadows,reflections, partial obscuration, and variations in viewing geometry arepresent. As stated above, the present invention also can detect, decode,and identify images and objects based on traditional symbols which mayappear on the object, such as alphanumeric characters, barcodes, or2-dimensional matrix codes.

When a particular object is identified, the position and orientation ofan object with respect to the user at the time the image was capturedcan be determined based on the appearance of the object in an image.This can be the location and/or identity of people scanned by multiplecameras in a security system, a passive locator system more accuratethan GPS or usable in areas where GPS signals cannot be received, thelocation of specific vehicles without requiring a transmission from thevehicle, and many other uses.

When the present invention is incorporated into a mobile device, such asa portable telephone, the user of the device can link to images andobjects in his or her environment by pointing the device at the objectof interest, then “pointing and clicking” to capture an image.Thereafter, the device transmits the image to another computer(“Server”), wherein the image is analyzed and the object or image ofinterest is detected and recognized. Then the network address ofinformation corresponding to that object is transmitted from the(“Server”) back to the mobile device, allowing the mobile device toaccess information using the network address so that only a portion ofthe information concerning the object need be stored in the systemsdatabase.

Some or all of the image processing, including image/object detectionand/or decoding of symbols detected in the image may be distributedarbitrarily between the mobile (Client) device and the Server. In otherwords, some processing may be performed in the Client device and some inthe Server, without specification of which particular processing isperformed in each, or all processing may be performed on one platform orthe other, or the platforms may be combined so that there is only oneplatform. The image processing can be implemented in a parallelcomputing manner, thus facilitating scaling of the system with respectto database size and input traffic loading.

Therefore, it is an object of the present invention to provide a systemand process for identifying digitally captured images without requiringmodification to the object.

Another object is to use digital capture devices in ways nevercontemplated by their manufacturer.

Another object is to allow identification of objects from partial viewsof the object.

Another object is to provide communication means with operative deviceswithout requiring a public connection therewith.

These and other objects and advantages of the present invention willbecome apparent to those skilled in the art after considering thefollowing detailed specification, together with the accompanyingdrawings wherein:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram top-level algorithm flowchart;

FIG. 2 is an idealized view of image capture;

FIGS. 3A and 3B are a schematic block diagram of process details of thepresent invention;

FIG. 4 is a schematic block diagram of a different explanation ofinvention;

FIG. 5 is a schematic block diagram similar to FIG. 4 for cellulartelephone and personal data assistant (PDA) applications; and

FIG. 6 is a schematic block diagram for spacecraft applications.

BEST MODES FOR CARRYING OUT THE INVENTION

The present invention includes a novel process whereby information suchas Internet content is presented to a user, based solely on a remotelyacquired image of a physical object. Although coded information can beincluded in the remotely acquired image, it is not required since noadditional information about a physical object, other than its image,needs to be encoded in the linked object. There is no need for anyadditional code or device, radio, optical or otherwise, to be embeddedin or affixed to the object. Image-linked objects can be located andidentified within user-acquired imagery solely by means of digital imageprocessing, with the address of pertinent information being returned tothe device used to acquire the image and perform the link. This processis robust against digital image noise and corruption (as can result fromlossy image compression/decompression), perspective error, rotation,translation, scale differences, illumination variations caused bydifferent lighting sources, and partial obscuration of the target thatresults from shadowing, reflection or blockage.

Many different variations on machine vision “target location andidentification” exist in the current art. However, they all tend toprovide optimal solutions for an arbitrarily restricted search space. Atthe heart of the present invention is a high-speed image matching enginethat returns unambiguous matches to target objects contained in a widevariety of potential input images. This unique approach to imagematching takes advantage of the fact that at least some portion of thetarget object will be found in the user-acquired image. The parallelimage comparison processes embodied in the present search technique are,when taken together, unique to the process. Further, additionalrefinement of the process, with the inclusion of more and/or differentdecomposition-parameterization functions, utilized within the overallstructure of the search loops is not restricted. The detailed process isdescribed in the following. FIG. 1 shows the overall processing flow andsteps. These steps are described in further detail in the followingsections.

For image capture 10, the User 12 (FIG. 2) utilizes a computer, mobiletelephone, personal digital assistant, or other similar device 14equipped with an image sensor (such as a CCD or CMOS digital camera).The User 12 aligns the sensor of the image capture device 14 with theobject 16 of interest. The linking process is then initiated by suitablemeans including: the User 12 pressing a button on the device 14 orsensor; by the software in the device 14 automatically recognizing thatan image is to be acquired; by User voice command; or by any otherappropriate means. The device 14 captures a digital image 18 of thescene at which it is pointed. This image 18 is represented as threeseparate 2-D matrices of pixels, corresponding to the raw RGB (Red,Green, Blue) representation of the input image. For the purposes ofstandardizing the analytical processes in this embodiment, if the device14 supplies an image in other than RGB format, a transformation to RGBis accomplished. These analyses could be carried out in any standardcolor format, should the need arise.

If the server 20 is physically separate from the device 14, then useracquired images are transmitted from the device 14 to the ImageProcessor/Server 20 using a conventional digital network or wirelessnetwork means. If the image 18 has been compressed (e.g. via lossy JPEGDCT) in a manner that introduces compression artifacts into thereconstructed image 18, these artifacts may be partially removed by, forexample, applying a conventional despeckle filter to the reconstructedimage prior to additional processing.

The Image Type Determination 26 is accomplished with a discriminatoralgorithm which operates on the input image 18 and determines whetherthe input image contains recognizable symbols, such as barcodes, matrixcodes, or alphanumeric characters. If such symbols are found, the image18 is sent to the Decode Symbol 28 process. Depending on the confidencelevel with which the discriminator algorithm finds the symbols, theimage 18 also may or alternatively contain an object of interest and maytherefore also or alternatively be sent to the Object Image branch ofthe process flow. For example, if an input image 18 contains both abarcode and an object, depending on the clarity with which the barcodeis detected, the image may be analyzed by both the Object Image andSymbolic Image branches, and that branch which has the highest successin identification will be used to identify and link from the object.

The image is analyzed to determine the location, size, and nature of thesymbols in the Decode Symbol 28. The symbols are analyzed according totheir type, and their content information is extracted. For example,barcodes and alphanumeric characters will result in numerical and/ortext information.

For object images, the present invention performs a “decomposition”, inthe Input Image Decomposition 34, of a high-resolution input image intoseveral different types of quantifiable salient parameters. This allowsfor multiple independent convergent search processes of the database tooccur in parallel, which greatly improves image match speed and matchrobustness in the Database Matching 36. The Best Match 38 from eitherthe Decode Symbol 28, or the image Database Matching 36, or both, isthen determined. If a specific URL (or other online address) isassociated with the image, then an URL Lookup 40 is performed and theInternet address is returned by the URL Return 42.

The overall flow of the Input Image Decomposition process is as follows:

Radiometric Correction Segmentation Segment Group Generation FOR eachsegment group Bounding Box Generation Geometric Normalization WaveletDecomposition Color Cube Decomposition Shape DecompositionLow-Resolution Grayscale Image Generation FOR END

Each of the above steps is explained in further detail below. ForRadiometric Correction, the input image typically is transformed to an8-bit per color plane, RGB representation. The RGB image isradiometrically normalized in all three channels. This normalization isaccomplished by linear gain and offset transformations that result inthe pixel values within each color channel spanning a full 8-bit dynamicrange (256 possible discrete values). An 8-bit dynamic range is adequatebut, of course, as optical capture devices produce higher resolutionimages and computers get faster and memory gets cheaper, higher bitdynamic ranges, such as 16-bit, 32-bit or more may be used.

For Segmentation, the radiometrically normalized RGB image is analyzedfor “segments,” or regions of similar color, i.e. near equal pixelvalues for red, green, and blue. These segments are defined by theirboundaries, which consist of sets of (x, y) point pairs. A map ofsegment boundaries is produced, which is maintained separately from theRGB input image and is formatted as an x, y binary image map of the sameaspect ratio as the RGB image.

For Segment Group Generation, the segments are grouped into all possiblecombinations. These groups are known as “segment groups” and representall possible potential images or objects of interest in the input image.The segment groups are sorted based on the order in which they will beevaluated. Various evaluation order schemes are possible. The particularembodiment explained herein utilizes the following “center-out” scheme:The first segment group comprises only the segment that includes thecenter of the image. The next segment group comprises the previoussegment plus the segment which is the largest (in number of pixels) andwhich is adjacent to (touching) the previous segment group. Additionalsegments are added using the segment criteria above until no segmentsremain. Each step, in which a new segment is added, creates a new andunique segment group.

For Bounding Box Generation, the elliptical major axis of the segmentgroup under consideration (the major axis of an ellipse just largeenough to contain the entire segment group) is computed. Then arectangle is constructed within the image coordinate system, with longsides parallel to the elliptical major axis, of a size just large enoughto completely contain every pixel in the segment group.

For Geometric Normalization, a copy of the input image is modified suchthat all pixels not included in the segment group under considerationare set to mid-level gray. The result is then resampled and mapped intoa “standard aspect” output test image space such that the corners of thebounding box are mapped into the corners of the output test image. Thestandard aspect is the same size and aspect ratio as the Referenceimages used to create the database.

For Wavelet Decomposition, a grayscale representation of the full-colorimage is produced from the geometrically normalized image that resultedfrom the Geometric Normalization step. The following procedure is usedto derive the grayscale representation. Reduce the three color planesinto one grayscale image by proportionately adding each R, G, and Bpixel of the standard corrected color image using the following formula:

L _(x,y)=0.34*R _(x,y)+0.55*G _(x,y)+0.44*B _(x,y)

then round to nearest integer value. Truncate at 0 and 255, ifnecessary. The resulting matrix L is a standard grayscale image. Thisgrayscale representation is at the same spatial resolution as the fullcolor image, with an 8-bit dynamic range. A multi-resolution WaveletDecomposition of the grayscale image is performed, yielding waveletcoefficients for several scale factors. The Wavelet coefficients atvarious scales are ranked according to their weight within the image.

For Color Cube Decomposition, an image segmentation is performed (see“Segmentation” above), on the RGB image that results from GeometricNormalization. Then the RGB image is transformed to a normalizedIntensity, In-phase and Quadrature-phase color image (YIQ). The segmentmap is used to identify the principal color regions of the image, sinceeach segment boundary encloses pixels of similar color. The average Y,I, and Q values of each segment, and their individual component standarddeviations, are computed. The following set of parameters result,representing the colors, color variation, and size for each segment:

Y_(avg)=Average Intensity

I_(avg)=Average In-phase

Q_(avg)=Average Quadrature

Y_(sigma)=Intensity standard deviation

I_(sigma)=In-phase standard deviation

Q_(sigma)=Quadrature standard deviation

N_(pixels)=number of pixels in the segment

The parameters comprise a representation of the color intensity andvariation in each segment. When taken together for all segments in asegment group, these parameters comprise points (or more accurately,regions, if the standard deviations are taken into account) in athree-dimensional color space and describe the intensity and variationof color in the segment group.

For Shape Decomposition, the map resulting from the segmentationperformed in the Color Cube Generation step is used and the segmentgroup is evaluated to extract the group outer edge boundary, the totalarea enclosed by the boundary, and its area centroid. Additionally, thenet ellipticity (semi-major axis divided by semi-minor axis of theclosest fit ellipse to the group) is determined.

For Low-Resolution Grayscale Image Generation, the full-resolutiongrayscale representation of the image that was derived in the WaveletGeneration step is now subsampled by a factor in both x and ydirections. For the example of this embodiment, a 3:1 subsampling isassumed. The subsampled image is produced by weighted averaging ofpixels within each 3×3 cell. The result is contrast binned, by reducingthe number of discrete values assignable to each pixel based uponsubstituting a “binned average” value for all pixels that fall within adiscrete (TBD) number of brightness bins.

The above discussion of the particular decomposition methodsincorporated into this embodiment are not intended to indicate thatmore, or alternate, decomposition methods may not also be employedwithin the context of this invention.

In other words:

FOR each input image segment group FOR each database object FOR eachview of this object FOR each segment group in this view of this databaseobject Shape Comparison Grayscale Comparison Wavelet Comparison ColorCube Comparison Calculate Combined Match Score END FOR END FOR END FOREND FOR

Each of the above steps is explained in further detail below.

For Each Input Image Segment Group

This loop considers each combination of segment groups in the inputimage, in the order in which they were sorted in the “Segment GroupGeneration” step. Each segment group, as it is considered, is acandidate for the object of interest in the image, and it is comparedagainst database objects using various tests.

One favored implementation, of many possible, for the order in which thesegment groups are considered within this loop is the “center-out”approach mentioned previously in the “Segment Group Generation” section.This scheme considers segment groups in a sequence that represents theaddition of adjacent segments to the group, starting at the center ofthe image. In this scheme, each new group that is considered comprisesthe previous group plus one additional adjacent image segment. The newgroup is compared against the database. If the new group results in ahigher database matching score than the previous group, then new groupis retained. If the new group has a lower matching score then theprevious group, then it is discarded and the loop starts again. If aparticular segment group results in a match score which is extremelyhigh, then this is considered to be an exact match and no furthersearching is warranted; in this case the current group and matchingdatabase group are selected as the match and this loop is exited.

For Each Database Object

This loop considers each object in the database for comparison againstthe current input segment group.

For Each View of this Object

This loop considers each view of the current database object, forcomparison against the current input segment group. The databasecontains, for each object, multiple views from different viewing angles.

For Each Segment Group in this View of this Database Object

This loop considers each combination of segment groups in the currentview of the database object. These segment groups were created in thesame manner as the input image segment groups.

Shape Comparison

Inputs:

For the input image and all database images:

I. Segment group outline

II. Segment group area

III. Segment group centroid location

IV. Segment group bounding ellipse ellipticity

Algorithm:

V. Identify those database segment groups with an area approximatelyequal to that of the input segment group, within TBD limits, andcalculate an area matching score for each of these “matches.”

VI. Within the set of matches identified in the previous step, identifythose database segment groups with an ellipticity approximately equal tothat of the input segment group, within TBD limits, and calculate anellipticity position matching score for each of these “matches.”

Within the set of matches identified in the previous step, identifythose database segment groups with a centroid position approximatelyequal to that of the input segment group, within TBD limits, andcalculate a centroid position matching score for each of these“matches.”

VIII. Within the set of matches identified in the previous step,identify those database segment groups with an outline shapeapproximately equal to that of the input segment group, within TBDlimits, and calculate an outline matching score for each of these“matches.” This is done by comparing the two outlines and analyticallydetermining the extent to which they match.

Note: this algorithm need not necessarily be performed in the order ofSteps 1 to 4. It could alternatively proceed as follows:

FOR each database segment group IF the group passes Step 1 IF the grouppasses Step 2 IF the group passes Step 3 IF the group passes Step 4Successful comparison, save result END IF END IF END IF END IF END FOR

Grayscale Comparison

Inputs:

For the input image and all database images:

IX. Low-resolution, normalized, contrast-binned, grayscale image ofpixels within segment group bounding box, with pixels outside of thesegment group set to a standard background color.

Algorithm:

Given a series of concentric rectangular “tiers” of pixels within thelow-resolution images, compare the input image pixel values to those ofall database images. Calculate a matching score for each comparison andidentify those database images with matching scores within TBD limits,as follows:

FOR each database image FOR each tier, starting with the innermost andprogressing to the outermost Compare the pixel values between the inputand database image Calculate an aggregate matching score IF matchingscore is greater than some TBD limit (i.e., close match) Successfulcomparison, save result END IF END FOR END FOR

Wavelet Comparison

Inputs:

For the input image and all database images:

X. Wavelet coefficients from high-resolution grayscale image withinsegment group bounding box.

Algorithm:

Successively compare the wavelet coefficients of the input segment groupimage and each database segment group image, starting with thelowest-order coefficients and progressing to the highest ordercoefficients. For each comparison, compute a matching score. For eachnew coefficient, only consider those database groups that had matchingscores, at the previous (next lower order) coefficient within TBDlimits.

FOR each database image IF input image C₀ equals database image C₀within TBD limit IF input image C₁ equals database image C₁ within TBDlimit IF input image C_(N) equals database image C_(N) within TBD limitClose match, save result and match score END IF END IF END IF END FORNotes: I. “C_(i)” are the wavelet coefficients, with C₀ being the lowestorder coefficient and C_(N) being the highest. II. When the coefficientsare compared, they are actually compared on a statistical (e.g.Gaussian) basis, rather than an arithmetic difference. III. Dataindexing techniques are used to allow direct fast access to databaseimages according to their C_(i) values. This allows the algorithm tosuccessively narrow the portions of the database of interest as itproceeds from the lowest order terms to the highest.

Color Cube Comparison

Inputs:

[Y_(avg), I_(avg), Q_(avg), Y_(Sigma), I_(sigma), Q_(sigma), N_(pixels)]data sets (“Color Cube Points”) for each segment in:

I. The input segment group image

II. Each database segment group image

Algorithm:

FOR each database image FOR each segment group in the database image FOReach Color Cube Point in database segment group, in order of descendingN_(pixels) value IF Gaussian match between input (Y,I,Q) and database(Y,I,Q) I. Calculate match score for this segment II. Accumulate segmentmatch score into aggregate match score for segment group III. IFaggregate matching score is greater than some TBD limit (i.e., closematch) Successful comparison, save result END IF END FOR END FOR END FORNotes: I. The size of the Gaussian envelope about any Y, I, Q point isdetermined by RSS of standard deviations of Y, I, and Q for that point.

Calculate Combined Match Score

The four Object Image comparisons (Shape Comparison, GrayscaleComparison, Wavelet Comparison, Color Cube Comparison) each return anormalized matching score. These are independent assessments of thematch of salient features of the input image to database images. Tominimize the effect of uncertainties in any single comparison process,and to thus minimize the likelihood of returning a false match, thefollowing root sum of squares relationship is used to combine theresults of the individual comparisons into a combined match score for animage:

CurrentMatch=SQRT(W _(OC) M _(OC) ² +W _(CCC) M _(CCC) ² +W _(WC) M_(WC) ² +W _(SGC) M _(SGC) ²)

where Ws are TBD parameter weighting coefficients and Ms are theindividual match scores of the four different comparisons.

The unique database search methodology and subsequent object matchscoring criteria are novel aspects of the present invention that deservespecial attention. Each decomposition of the Reference image and Inputimage regions represent an independent characterization of salientcharacteristics of the image. The Wavelet Decomposition, Color CubeDecomposition, Shape Decomposition, and evaluation of a sub-sampledlow-resolution Grayscale representation of an input image all producesets of parameters that describe the image in independent ways. Once allfour of these processes are completed on the image to be tested, theparameters provided by each characterization are compared to the resultsof identical characterizations of the Reference images, which have beenpreviously calculated and stored in the database. These comparisons, orsearches, are carried out in parallel. The result of each search is anumerical score that is a weighted measure of the number of salientcharacteristics that “match” (i.e. that are statistically equivalent).Near equivalencies are also noted, and are counted in the cumulativescore, but at a significantly reduced weighting.

One novel aspect of the database search methodology in the presentinvention is that not only are these independent searches carried out inparallel, but also, all but the low-resolution grayscale compares are“convergent.” By convergent, it is meant that input image parameters aresearched sequentially over increasingly smaller subsets of the entiredatabase. The parameter carrying greatest weight from the input image iscompared first to find statistical matches and near-matches in alldatabase records. A normalized interim score (e.g., scaled value fromzero to one, where one is perfect match and zero is no match) iscomputed, based on the results of this comparison. The next heaviestweighted parameter from the input image characterization is thensearched on only those database records having initial interim scoresabove a minimum acceptable threshold value. This results in anincremental score that is incorporated into the interim score in acumulative fashion. Then, subsequent compares of increasinglylesser-weighted parameters are assessed only on those database recordsthat have cumulative interim scores above the same minimum acceptablethreshold value in the previous accumulated set of tests.

This search technique results in quick completion of robust matches, andestablishes limits on the domain of database elements that will becompared in a subsequent combined match calculation and therefore speedsup the process. The convergent nature of the search in these comparisonsyields a ranked subset of the entire database.

The result of each of these database comparisons is a ranking of thematch quality of each image, as a function of decomposition searchtechnique. Only those images with final cumulative scores above theacceptable match threshold will be assessed in the next step, a CombinedMatch Score evaluation.

Four database comparison processes, Shape Comparison, GrayscaleComparison, Wavelet Comparison, and Color Cube Comparison, areperformed. These processes may occur sequentially, but generally arepreferably performed in parallel on a parallel computing platform. Eachcomparison technique searches the entire image database and returnsthose images that provide the best matches, for the particularalgorithm, along with the matching scores for these images. Thesecomparison algorithms are performed on segment groups, with each inputimage segment group being compared to each segment group for eachdatabase image.

FIGS. 3A and 3B show the process flow within the Database Matchingoperation. The algorithm is presented here as containing four nestedloops with four parallel processes inside the innermost loop. Thisstructure is for presentation and explanation only. The actualimplementation, although performing the same operations at the innermostlayer, can have a different structure in order to achieve the maximumbenefit from processing speed enhancement techniques such as parallelcomputing and data indexing techniques. It is also important to notethat the loop structures can be implemented independently for each innercomparison, rather than the shared approach shown in the FIGS. 3A and3B.

Preferably, parallel processing is used to divide tasks between multipleCPUs (Central Processing Units) and/or computers. The overall algorithmmay be divided in several ways, such as:

Sharing the In this technique, all CPUs run the entire algorithm, OuterLoop: including the outer loop, but one CPU runs the loop for the firstN cycles, another CPU for the second N cycles, all simultaneously.Sharing the In this technique, one CPU performs the loop Comparisons:functions. When the comparisons are performed, they are each passed to aseparate CPU to be performed in parallel. Sharing the This techniqueentails splitting database searches Database: between CPUs, so that eachCPU is responsible for searching one section of the database, and thesections are searched in parallel by multiple CPUs. This is, in essence,a form of the “Sharing the Outer Loop” technique described above.

Actual implementations can be some combination of the above techniquesthat optimizes the process on the available hardware.

Another technique employed to maximize speed is data indexing. Thistechnique involves using a priori knowledge of where data resides toonly search in those parts of the database that contain potentialmatches. Various forms of indexing may be used, such as hash tables,data compartmentalization (i.e., data within certain value ranges arestored in certain locations), data sorting, and database table indexing.An example of such techniques is, in the Shape Comparison algorithm (seebelow), if a database is to be searched for an entry with an Area with avalue of A, the algorithm would know which database entries or dataareas have this approximate value and would not need to search theentire database.

Another technique employed is as follows. FIG. 4 shows a simplifiedconfiguration of the invention. Boxes with solid lines representprocesses, software, physical objects, or devices. Boxes with dashedlines represent information. The process begins with an object ofinterest: the target object 100. In the case of consumer applications,the target object 100 could be, for example, beverage can, a music CDbox, a DVD video box, a magazine advertisement, a poster, a theatre, astore, a building, a car, or any other object that user is interested inor wishes to interact with. In security applications the target object100 could be, for example, a person, passport, or driver's license, etc.In industrial applications the target object 100 could be, for example,a part in a machine, a part on an assembly line, a box in a warehouse,or a spacecraft in orbit, etc.

The terminal 102 is a computing device that has an “image” capturedevice such as digital camera 103, a video camera, or any other devicethat an convert a physical object into a digital representation of theobject. The imagery can be a single image, a series of images, or acontinuous video stream. For simplicity of explanation this documentdescribes the digital imagery generally in terms of a single image,however the invention and this system can use all of the imagery typesdescribed above.

After the camera 103 captures the digital imagery of the target object100, image preprocessing 104 software converts the digital imagery intoimage data 105 for transmission to and analysis by an identificationserver 106. Typically a network connection is provided capable ofproviding communications with the identification server 106. Image data105 is data extracted or converted from the original imagery of thetarget object 100 and has information content appropriate foridentification of the target object 100 by the object recognition 107,which may be software or hardware. Image data 105 can take many forms,depending on the particular embodiment of the invention. Examples ofimage data 105 are:

Compressed (e.g., JPEG2000) form of the raw imagery from camera 103;

Key image information, such as spectral and/or spatial frequencycomponents (e.g. wavelet components) of the raw imagery from camera 103;and

MPEG video stream created from the raw imagery from camera 103.

The particular form of the image data 105 and the particular operationsperformed in image preprocessing 104 depend on:

Algorithm and software used in object recognition 107 Processing powerof terminal 102;

Network connection speed between terminal 102 and identification server106;

Application of the System; and

Required system response time.

In general, there is a tradeoff between the network connection speed(between terminal 102 and identification server 106) and the processingpower of terminal 102. The results all of the above tradeoffs willdefine the nature of image preprocessing 104 and image data 105 for aspecific embodiment. For example, image preprocessing 104 could be imagecompression and image data 105 compressed imagery, or imagepreprocessing 104 could be wavelet analysis and image data 105 could bewavelet coefficients.

The image data 105 is sent from the terminal 102 to the identificationserver 106. The identification server 106 receives the image data 105and passes it to the object recognition 107.

The identification server 106 is a set of functions that usually willexist on computing platform separate from the terminal 102, but couldexist on the same computing platform. If the identification server 106exists on a separate computing device, such as a computer in a datacenter, then the transmission of the image components 105 to theidentification server 106 is accomplished via a network or combinationof networks, such a cellular telephone network, wireless Internet,Internet, and wire line network. If the identification server 106 existson the same computing device as the terminal 102 then the transmissionconsists simply of a transfer of data from one software component orprocess to another.

Placing the identification server 106 on a computing platform separatefrom the terminal 102 enables the use of powerful computing resourcesfor the object recognition 107 and database 108 functions, thusproviding the power of these computing resources to the terminal 102 vianetwork connection. For example, an embodiment that identifies objectsout of a database of millions of known objects would be facilitated bythe large storage, memory capacity, and processing power available in adata center; it may not be feasible to have such computing power andstorage in a mobile device. Whether the terminal 102 and theidentification server 106 are on the same computing platform or separateones is an architectural decision that depends on system response time,number of database records, image recognition algorithm computing powerand storage available in terminal 102, etc., and this decision must bemade for each embodiment of the invention. Based on current technology,in most embodiments these functions will be on separate computingplatforms.

The overall function of the identification server 106 is to determineand provide the target object information 109 corresponding to thetarget object 100, based on the image data 105.

The object recognition 107 and the database 108 function together to:

1. Detect, recognize, and decode symbols, such as barcodes or text, inthe image.

2. Recognize the object (the target object 100) in the image.

3. Provide the target object information 109 that corresponds to thetarget object 100. The target object information 109 usually (dependingon the embodiment) includes an information address corresponding to thetarget object 100.

The object recognition 107 detects and decodes symbols, such as barcodesor text, in the input image. This is accomplished via algorithms,software, and/or hardware components suited for this task. Suchcomponents are commercially available (The HALCON software package fromMVTec is an example). The object recognition 107 also detects andrecognizes images of the target object 100 or portions thereof. This isaccomplished by analyzing the image data 105 and comparing the resultsto other data, representing images of a plurality of known objects,stored in the database 108, and recognizing the target object 100 if arepresentation of target object 100 is stored in the database 108.

In some embodiments the terminal 102 includes software, such as a webbrowser (the browser 110), that receives an information address,connects to that information address via a network or networks, such asthe Internet, and exchanges information with another computing device atthat information address. In consumer applications the terminal 102 maybe a portable cellular telephone or Personal Digital Assistant equippedwith a camera 103 and wireless Internet connection. In security andindustrial applications the terminal 102 may be a similar portablehand-held device or may be fixed in location and/or orientation, and mayhave either a wireless or wire line network connection.

Other object recognition techniques also exist and include methods thatstore 3-dimensional models (rather than 2-dimensional images) of objectsin a database and correlate input images with these models of the targetobject is performed by an object recognition technique of which many areavailable commercially and in the prior art. Such object recognitiontechniques usually consist of comparing a new input image to a pluralityof known images and detecting correspondences between the new inputimage and one of more of the known images. The known images are views ofknown objects from a plurality of viewing angles and thus allowrecognition of 2-dimensional and 3-dimensional objects in arbitraryorientations relative to the camera 103.

FIG. 4 shows the object recognition 107 and the database 108 as separatefunctions for simplicity. However, in many embodiments the objectrecognition 107 and the database 108 are so closely interdependent thatthey may be considered a single process.

There are various options for the object recognition technique and theparticular processes performed within the object recognition 107 and thedatabase 108 depend on this choice. The choice depends on the nature,requirements, and architecture of the particular embodiment of theinvention. However, most embodiments will usually share most of thefollowing desired attributes of the image recognition technique:

Capable of recognizing both 2-dimensional (i.e., flat) and 3-dimensionalobjects;

Capable of discriminating the target object 100 from any foreground orbackground objects or image information, i.e., be robust with respect tochanges in background;

Fast;

Autonomous (no human assistance required in the recognition process);

Scalable; able to identify objects from a large database of knownobjects with short response time; and

Robust with respect to:

Affine transformations (rotation, translation, scaling);

Non-affine transformations (stretching, bending, breaking);

Occlusions (of the target object 100);

Shadows (on the target object 100);

Reflections (on the target object 100);

Variations in light color temperature;

Image noise;

Capable of determining position and orientation of the target object 100in the original imagery; and

Capable of recognizing individual human faces from a database containingdata representing a large plurality of human faces.

All of these attributes do not apply to all embodiments. For example,consumer linking embodiments generally do not require determination ofposition and orientation of the target object 100, while a spacecrafttarget position and orientation determination system generally would notbe required to identify human faces or a large number of differentobjects.

It is usually desirable that the database 108 be scalable to enableidentification of the target object 100 from a very large plurality (forexample, millions) of known objects in the database 108. The algorithms,software, and computing hardware must be designed to function togetherto quickly perform such a search. An example software technique forperforming such searching quickly is to use a metric distance comparisontechnique for comparing the image data 105 to data stored in thedatabase 108, along with database clustering and multiresolutiondistance comparisons. This technique is described in “Fast ExhaustiveMulti-Resolution Search Algorithm Based on Clustering for EfficientImage Retrieval,” by Song, Kim, and Ra, 2000.

In addition to such software techniques, a parallel processing computingarchitecture may be employed to achieve fast searching of largedatabases. Parallel processing is particularly important in cases wherea non-metric distance is used in object recognition 107, becausetechniques such database clustering and multiresolution search may notbe possible and thus the complete database must be searched bypartitioning the database across multiple CPUs.

As described above, the object recognition 107 can also detectidentifying marks on the target object 100. For example, the targetobject 100 may include an identifying number or a barcode. Thisinformation can be decoded and used to identify or help identify thetarget object 100 in the database 108. This information also can bepassed on as part of the target object information 109. If theinformation is included as part of the target object information 109then it can be used by the terminal 102 or content server 111 toidentify the specific target object 100, out of many such objects thathave similar appearance and differ only in the identifying marks. Thistechnique is useful, for example, in cases where the target object 100is an active device with a network connection (such as a vendingmachine) and the content server establishes communication with thetarget object 100. A combination with a Global Positioning System canalso be used to identify like objects by their location.

The object recognition 107 may be implemented in hardware, software, ora combination of both. Examples of each category are presented below.

Hardware object recognition implementations include optical correlators,optimized computing platforms, and custom hardware.

Optical correlators detect objects in images very rapidly by, in effect,performing image correlation calculations with light. Examples ofoptical correlators are:

Litton Miniaturized Ruggedized Optical Correlator, from Northrop GrummanCorp;

Hybrid Digital/Optical Correlator, from the School of Engineering andInformation Technology, University of Sussex, UK; and

OC-VGA3000 and OC-VGA6000 Optical Correlators from INO, Quebec, Canada.

Optimized computing platforms are hardware computing systems, usually ona single board, that are optimized to perform image processing andrecognition algorithms very quickly. These platforms must be programmedwith the object recognition algorithm of choice. Examples of optimizedcomputing platforms are:

VIP/Balboa™ Image Processing Board, from Irvine Sensors Corp.; and

3DANN™-R Processing System, from Irvine Sensors Corp.

Image recognition calculations can also be implemented directly incustom hardware in forms such as Application Specific IntegratedCircuits (ASICs), Field Programmable Gate Arrays (FPGAs), and DigitalSignal Processors (DSPs).

There are many object and image recognition software applicationsavailable commercially and many algorithms published in the literature.Examples of commercially available image/object recognition softwarepackages include:

Object recognition system, from Sandia National Laboratories;

Object recognition perception modules, from Evolution Robotics;

ImageFinder, from Attrasoft;

ImageWare, from Roz Software Systems; and

ID-2000, from Imagis Technologies.

Some of the above recognition systems include 3-dimensional objectrecognition capability while others perform 2-dimensional imagerecognition. The latter type are used to perform 3-dimensional objectrecognition by comparing input images to a plurality of 2-dimensionalviews of objects from a plurality of viewing angles.

Examples of object recognition algorithms in the literature and intendedfor implementation in software are:

Distortion Invariant Object Recognition in the Dynamic LinkArchitecture, Lades et al, 1993;

SEEMORE: Combining Color, Shape, and Texture Histogramming in a NeurallyInspired Approach to Visual Object Recognition, Mel, 1996;

Probabilistic Affine Invariants for Recognition, Leung et al, 1998;

Software Library for Appearance Matching (SLAM), Nene at al, 1994;

Probabilistic Models of Appearance for 3-D Object Recognition, Pope &Lowe, 2000;

Matching 3D Models with Shape Distributions, Osada et al, 2001;

Finding Pictures of Objects in Large Collections of Images, Forsyth etal, 1996;

The Earth Mover's Distance under Transformation Sets, Cohen & Guibas,1999;

Object Recognition from Local Scale-Invariant Features, Lowe, 1999; and

Fast Object Recognition in Noisy Images Using Simulated Annealing, Betke&

Makris, 1994.

Part of the current invention is the following object recognitionalgorithm specifically designed to be used as the object recognition 107and, to some extent, the database 108. This algorithm is robust withrespect to occlusions, reflections, shadows, background/foregroundclutter, object deformation and breaking, and is scalable to largedatabases. The task of the algorithm is to find an object or portionthereof in an input image, given a database of multiple objects withmultiple views (from different angles) of each object.

This algorithm uses the concept of a Local Image Descriptor (LID) tosummarize the information in a local region of an image. A LID is acircular subset, or “cutout,” of a portion of an image. There arevarious formulations for LIDs; two examples are:

LID Formulation 1

The area within the LID is divided into range and angle bins. Theaverage color in each [range,angle] bin is calculated from the pixelvalues therein.

LID Formulation 2

The area within the LID is divided into range bins. The color histogramvalues within each range bin are calculated from the pixel valuestherein. For each range bin, a measure of the variation of color withangle is calculated as, for example, the sum of the changes in averagecolor between adjacent small angular slices of a range bin.

A LID in the input image is compared to a LID in a database image by acomparison technique such the L1 Distance, L2 Distance, UnfoldedDistance, Earth Mover Distance, or cross-correlation. Small distancesindicate a good match between the portions of the images underlying theLIDS. By iteratively changing the position and size of the LIDs in theinput and database images the algorithm converges on the best matchbetween circular regions in the 2 images.

Limiting the comparisons to subsets (circular LIDs) of the imagesenables the algorithm to discriminate an object from the background.Only LIDs that fall on the object, as opposed to the background, yieldgood matches with database images. This technique also enable matchingof partially occluded objects; a LID that falls on the visible part ofan occluded object will match to a LID in the corresponding location inthe database image of the object.

The iteration technique used to find the best match is simulatedannealing, although genetic search, steepest descent, or other similartechniques appropriate for multivariable optimization can also be usedindividually or in combination with simulated annealing. Simulatedannealing is modeled after the concept of a molten substance cooling andsolidifying into a solid. The algorithm starts at a given temperatureand then the temperature is gradually reduced with time. At each timestep, the values of the search variables are perturbed from the theirprevious values to a create a new “child” generation of LIDs. Theperturbations are calculated statistically and their magnitudes arefunctions of the temperature. As the temperature decreases theperturbations decrease in size. The child LIDs, in the input anddatabase images, are then compared. If the match is better than thatobtained with the previous “parent” generation, then a statisticaldecision is made regarding to whether to accept or reject the child LIDsas the current best match. This is a statistical decision that is afunction of both the match distance and the temperature. The probabilityof child acceptance increases with temperature and decreases with matchdistance. Thus, good matches (small match distance) are more likely tobe accepted but poor matches can also be accepted occasionally. Thelatter case is more likely to occur early in the process when thetemperature is high. Statistical acceptance of poor matches is includedto allow the algorithm to “jump” out of local minima.

When LID Formulation 1 is used, the rotation angle of the LID need notnecessarily be a simulated annealing search parameter. Fasterconvergence can be obtained by performing a simple step-wise search onrotation to find the best orientation (within the tolerance of the stepsize) within each simulated annealing time step.

The search variables, in both the input and database images, are:

LID x-position;

LID y-position;

LID radius;

LID x-stretch;

LID y-stretch; and

LID orientation angle (only for LID Formulation 1).

LID x-stretch and LID y-stretch are measures of “stretch” distortionapplied to the LID circle, and measure the distortion of the circle intoan oval. This is included to provide robustness to differences inorientation and curvature between the input and database images.

The use of multiple simultaneous LIDs provides additional robustness toocclusions, shadows, reflections, rotations, deformations, and objectbreaking. The best matches for multiple input image LIDS are soughtthroughout the database images. The input image LIDS are restricted toremain at certain minimum separation distances from each other. Theminimum distance between any 2 LIDs centers is a function of the LIDradii. The input image LIDS converge and settle on the regions of theinput image having the best correspondence to any regions of anydatabase images. Thus the LIDs behave in the manner of marbles rollingtowards the lowest spot on a surface, e.g., the bottom of a bowl, butbeing held apart by their radius (although LIDS generally have minimumseparation distances that are less than their radii).

In cases where the object in the input image appears deformed or curvedrelative to the known configuration in which it appears in the database,multiple input image LIDS will match to different database images. Eachinput image LID will match to that database image which shows theunderlying portion of the object as it most closely resembles the inputimage. If the input image object is bent, e.g., a curved poster, thenone part will match to one database orientation and another part willmatch to a different orientation.

In the case where the input image object appears to be broken intomultiple pieces, either due to occlusion or to physical breakage, use ofmultiple LIDs again provides robust matching: individual LIDs “settle”on portions of the input image object as they match to correspondingportions of the object in various views in the database.

Robustness with respect to shadows and reflections is provided by LIDssimply not detecting good matches on these input image regions. They arein effect accommodated in the same manner as occlusions.

Robustness with respect to curvature and bending is accommodated bymultiple techniques. First, use of multiple LIDs provides suchrobustness as described above. Secondly, curvature and bendingrobustness is inherently provided to some extent within each LID by useof LID range bin sizes that increase with distance from the LID center(e.g., logarithmic spacing). Given matching points in an input image anddatabase image, deformation of the input image object away from theplane tangent at the matching point increases with distance from thematching point. The larger bin sizes of the outer bins (in both rangeand angle) reduce this sensitivity because they are less sensitive toimage shifts.

Robustness with respect to lighting color temperature variations isprovided by normalization of each color channel within each LID.

Fast performance, particular with large databases, can be obtainedthrough several techniques, as follows:

1. Use of LID Formulation 2 can reduce the amount of search by virtue ofbeing rotationally invariant, although this comes at the cost of somerobustness due to loss of image information.

2. If a metric distance (e.g., L1, L2, or Unfolded) is used for LIDcomparison, then database clustering, based on the triangle inequality,can be used to rule out large portions of the database from searching.Since database LIDs are created during the execution of the algorithm,the run-time database LIDs are not clustered. Rather, during preparationof the database, sample LIDs are created from the database images bysampling the search parameters throughout their valid ranges. From thisdata, bounding clusters can be created for each image and for portionsof images. With this information the search algorithm can rule outportions of the search parameter space.

3. If a metric distance is used, then progressive multiresolution searchcan be used. This technique saves time by comparing data first at lowresolution and only proceeds with successive higher-resolutioncomparison on candidates with correlations better than the current bestmatch. A discussion of this technique, along with database clustering,can be found in “Fast Exhaustive Multi-Resolution Search Algorithm Basedon Clustering for Efficient Image Retrieval,” by Song et al, 2000.

4. The parameter search space and number of LIDs can be limited. Boundscan be placed, for example, on the sizes of LIDs depending on theexpected sizes of input image objects relative to those in the database.A small number of LIDs, even 1, can be used, at the expense of somerobustness.

5. LIDs can be fixed in the database images. This eliminates iterativesearching on database LID parameters, at the expense of some robustness.

6. The “x-stretch” and “y-stretch” search parameters can be eliminated,although there is a trade-off between these search parameters and thenumber of database images. These parameters increase the ability tomatch between images of the same object in different orientations.Elimination of these parameters may require more database images withcloser angular spacing, depending on the particular embodiment.

7. Parallel processing can be utilized to increase computing power.

This technique is similar to that described by Betke & Makris in “FastObject Recognition in Noisy Images Using Simulated Annealing”, 1994,with the following important distinctions:

The current algorithm is robust with respect to occlusion. This is madepossible by varying size and position of LIDs in database images, duringthe search process, in order to match non-occluded portions of databaseimages.

The current algorithm can identify 3-dimensional objects by containingviews of objects from many orientations in the database.

The current algorithm uses database clustering to enable rapid searchingof large databases.

The current algorithm uses circular LIDs.

In addition to containing image information, the database 108 alsocontains address information. After the target object 100 has beenidentified, the database 108 is searched to find informationcorresponding to the target object 100. This information can be aninformation address, such as an Internet URL. The identification server106 then sends this information, in the form of the target objectinformation 109, to the terminal 102. Depending on the particularembodiment of the invention, the target object information 109 mayinclude, but not be limited to, one or more of the following items ofinformation pertaining to the target object 100:

Information address (e.g., Internet URL);

Identity (e.g., object name, number, classification, etc.);

Position;

Orientation;

Size;

Color;

Status;

Information decoded from and/or referenced by symbols (e.g. informationcoded in a barcode or a URL referenced by such a barcode); and

Other data (e.g. alphanumerical text).

Thus, the identification server determines the identity and/or variousattributes of the target object 100 from the image data 105.

The target object information 109 is sent to the terminal 102. Thisinformation usually flows via the same communication path used to sendthe image data 105 from the terminal 102 to the identification server106, but this is not necessarily the case. This method of this flowinformation depends on the particular embodiment of the invention.

The terminal 102 receives the target object information 109. Theterminal 102 then performs some action or actions based on the targetobject information 109. This action or actions may include, but not belimited to:

Accessing a web site.

Accessing or initiating a software process on the terminal 102.

Accessing or initiating a software process on another computer via anetwork or networks such as the Internet.

Accessing a web service (a software service accessed via the Internet).

Initiating a telephone call (if the terminal 102 includes suchcapability) to a telephone number that may be included in or determinedby the target object Information, may be stored in the terminal 102, ormay be entered by the user.

Initiating a radio communication (if the terminal 102 includes suchcapability) using a radio frequency that may be included in ordetermined by the target object Information, may be stored in theterminal 102, or may be entered by the user.

Sending information that is included in the target object information109 to a web site, a software process (on another computer or on theterminal 102), or a hardware component.

Displaying information, via the screen or other visual indication, suchas text, graphics, animations, video, or indicator lights.

Producing an audio signal or sound, including playing music.

In many embodiments, the terminal 102 sends the target objectinformation 109 to the browser 110. The browser 110 may or may not existin the terminal 102, depending on the particular embodiment of theinvention. The browser 110 is a software component, hardware component,or both, that is capable of communicating with and accessing informationfrom a computer at an information address contained in target objectinformation 109.

In most embodiments the browser 110 will be a web browser, embedded inthe terminal 102, capable of accessing and communicating with web sitesvia a network or networks such as the Internet. In some embodiments,however, such as those that only involve displaying the identity,position, orientation, or status of the target object 100, the browser110 may be a software component or application that displays or providesthe target object information 109 to a human user or to another softwarecomponent or application.

In embodiments wherein the browser 110 is a web browser, the browser 110connects to the content server 111 located at the information address(typically an Internet URL) included in the target object information109. This connection is effected by the terminal 102 and the browser 110acting in concert. The content server 111 is an information server andcomputing system. The connection and information exchanged between theterminal 102 and the content server 111 generally is accomplished viastandard Internet and wireless network software, protocols (e.g. HTTP,WAP, etc.), and networks, although any information exchange techniquecan be used. The physical network connection depends on the systemarchitecture of the particular embodiment but in most embodiments willinvolve a wireless network and the Internet. This physical network willmost likely be the same network used to connect the terminal 102 and theidentification server 106.

The content server 111 sends content information to the terminal 102 andbrowser 110. This content information usually is pertinent to the targetobject 100 and can be text, audio, video, graphics, or information inany form that is usable by the browser 110 and terminal 102. Theterminal 102 and browser 110 send, in some embodiments, additionalinformation to the content server 111. This additional information canbe information such as the identity of the user of the terminal 102 orthe location of the user of the terminal 102 (as determined from a GPSsystem or a radio-frequency ranging system). In some embodiments suchinformation is provided to the content server by the wireless networkcarrier.

The user can perform ongoing interactions with the content server 111.For example, depending on the embodiment of the invention and theapplications, the user can:

Listen to streaming audio samples if the target object 100 is an audiorecording (e.g., compact audio disc).

Purchase the target object 100 via on-line transaction, with thepurchase amount billed to an account linked to the terminal 102, to theindividual user, to a bank account, or to a credit card.

In some embodiments the content server 111 may reside within theterminal 102. In such embodiments, the communication between theterminal 102 and the content server 111 does not occur via a network butrather occurs within the terminal 102.

In embodiments wherein the target object 100 includes or is a devicecapable of communicating with other devices or computers via a networkor networks such as the Internet, and wherein the target objectinformation 109 includes adequate identification (such as a sign,number, or barcode) of the specific target object 100, the contentserver 111 connects to and exchanges information with the target object100 via a network or networks such as the Internet. In this type ofembodiment, the terminal 102 is connected to the content server 111 andthe content server 111 is connected to the target object 100. Thus, theterminal 102 and target object 100 can communicate via the contentserver 111. This enables the user to interact with the target object 100despite the lack of a direct connection between the target object 100and the terminal 102.

The following are examples of embodiments of the invention.

FIG. 5 shows a preferred embodiment of the invention that uses acellular telephone, PDA, or such mobile device equipped withcomputational capability, a digital camera, and a wireless networkconnection, as the terminal 202 corresponding to the terminal 102 inFIG. 4. In this embodiment, the terminal 202 communicates with theidentification server 206 and the content server 211 via networks suchas a cellular telephone network and the Internet.

This embodiment can be used for applications such as the following(“User” refers to the person operating the terminal 202, and theterminal 202 is a cellular telephone, PDA, or similar device, and “pointand click” refers to the operation of the User capturing imagery of thetarget object 200 and initiating the transfer of the image data 205 tothe identification server 206).

The User “points and clicks” the terminal 202 at a compact disc (CD)containing recorded music or a digital video disc (DVD) containingrecorded video. The terminal 202 browser connects to the URLcorresponding to the CD or DVD and displays a menu of options from whichthe user can select. From this menu, the user can listen to streamingaudio samples of the CD or streaming video samples of the DVD, or canpurchase the CD or DVD.

The User “points and clicks” the terminal 202 at a print mediaadvertisement, poster, or billboard advertising a movie, musicrecording, video, or other entertainment. The browser 210 connects tothe URL corresponding to the advertised item and the user can listen tostreaming audio samples, purchase streaming video samples, obtain showtimes, or purchase the item or tickets.

The User “points and clicks” the terminal 202 at a television screen tointeract with television programming in real-time. For example, theprogramming could consist of a product promotion involving a reducedprice during a limited time. Users that “point and click” on thistelevision programming during the promotion are linked to a web site atwhich they can purchase the product at the promotional price. Anotherexample is a interactive television programming in which users “pointand click” on the television screen at specific times, based on theon-screen content, to register votes, indicate actions, or connect to aweb site through which they perform real time interactions with theon-screen program.

The User “points and clicks” on an object such as a consumer product, anadvertisement for a product, a poster, etc., the terminal 202 makes atelephone call to the company selling the product, and the consumer hasa direct discussion with a company representative regarding thecompany's product or service. In this case the company telephone numberis included in the target object information 209. If the target objectinformation 209 also includes the company URL then the User can interactwith the company via both voice and Internet (via browser 210)simultaneously.

The User “points and clicks” on a vending machine (target object 200)that is equipped with a connection to a network such as the Internet andthat has a unique identifying mark, such as a number. The terminal 202connects to the content server 211 of the company that operates thevending machine. The identification server identifies the particularvending machine by identifying and decoding the unique identifying mark.The identity of the particular machine is included in the target objectinformation 209 and is sent from the terminal 202 to the content server211. The content server 211, having the identification of the particularvending machine (target object 200), initiates communication with thevending machine. The User performs a transaction with the vendingmachine, such as purchasing a product, using his terminal 202 thatcommunicates with the vending machine via the content server 211.

The User “points and clicks” on part of a machine, such as an aircraftpart. The terminal 202 then displays information pertinent to the part,such as maintenance instructions or repair history.

The User “points and clicks” on a magazine or newspaper article and linkto streaming audio or video content, further information, etc.

The User “points and clicks” on an automobile. The location of theterminal 206 is determined by a Global Position System receiver in theterminal 206, by cellular network radio ranging, or by anothertechnique. The position of the terminal 202 is sent to the contentserver 211. The content server provides the User with informationregarding the automobile, such as price and features, and furthermore,based on the position information, provides the User with the locationof a nearby automobile dealer that sells the car. This same techniquecan be used to direct Users to nearby retail stores selling itemsappearing in magazine advertisements that Users “point and click” on.

For visually impaired people:

Click on any item in a store and the device speaks the name of the itemand price to you (the items must be in the database).

Click on a newspaper or magazine article and the device reads thearticle to you.

Click on a sign (building, streetsign, etc.) and the device reads thesign to you and provides any addition pertinent information (the signsmust be in the database).

FIG. 6 shows an embodiment of the invention for spacecraft applications.In this embodiment, all components of the system (except the targetobject 300) are onboard a Spacecraft. The target object 300 is anotherspacecraft or object. This embodiment is used to determine the positionand orientation of the target object 300 relative to the Spacecraft sothat this information can be used in navigating, guiding, andmaneuvering the spacecraft relative to the target object 300. An exampleuse of this embodiment would be in autonomous spacecraft rendezvous anddocking.

This embodiment determines the position and orientation of the targetobject 300, relative to the Spacecraft, as determined by the position,orientation, and size of the target object 300 in the imagery capturedby the camera 303, by comparing the imagery with views of the targetobject 300 from different orientations that are stored in the database308. The relative position and orientation of the target object 300 areoutput in the target object information, so that the spacecraft datasystem 310 can use this information in planning trajectories andmaneuvers.

INDUSTRIAL APPLICABILITY

The industrial applicability is anywhere that objects are to beidentified by a digital optical representation of the object.

What is claimed is:
 1. A mobile printed media interaction apparatus comprising: a memory; a processor; and an image matching engine configured to execute on the processor according to software instruction stored in the memory, and that configures the processor to: receive a digital image of at least a portion of a printed media object, the digital image acquired by a mobile device; derive salient features from the digital image; recognize the printed media object as a target object based the salient features; determine a position or an orientation of the printed media object with respect to at least a portion of a user within the digital image; identify content information corresponding to the printed media object based on recognizing the printed media object as the target object, the content information available at an address and stored in a content database; and cause a second device to perform an action associated with the content information based, at least in part, on the position or the orientation of the printed media object with respect to the at least the portion of the user within the digital image.
 2. The apparatus of claim 1, wherein the salient features comprise an image descriptor.
 3. The apparatus of claim 1, wherein the image descriptor comprises a scale-invariant feature.
 4. The apparatus of claim 1, wherein the digital image comprises a digital still image.
 5. The apparatus of claim 1, wherein the digital image comprises a digital video.
 6. The apparatus of claim 1, wherein the mobile device comprises at least one of the following: a robot, a cell phone, and a personal data assistant.
 7. The apparatus of claim 1, wherein the second device is the mobile device.
 8. The apparatus of claim 1, wherein the second device is at least one of the following: a robot, a cell phone, a vehicle, a vending machine, a personal data assistant, and a television.
 9. The apparatus of claim 1, wherein the image matching engine is further configured to receive a user voice command.
 10. The apparatus of claim 1, wherein the printed media object comprises at least one of the following: a passport, a driver's license, a newspaper, a magazine, an article, a billboard, a poster, a sign, an advertisement, a box, and a beverage can.
 11. The apparatus of claim 1, wherein the content information includes at least one of the following: an instruction, a repair history, music data, audio data, video data, graphics data, an animation, and text data.
 12. The apparatus of claim 1, wherein the action comprises an on-going interaction.
 13. The apparatus of claim 1, wherein the action comprises accessing a web site or web service over a network.
 14. The apparatus of claim 1, wherein the action comprises initiating a software process on the second device.
 15. The apparatus of claim 1, wherein the action comprises initiating a phone call.
 16. The apparatus of claim 1, wherein the action comprises initiating a radio communication.
 17. The apparatus of claim 1, wherein the action comprises reading a portion of the printed media object.
 18. The apparatus of claim 1, wherein the action comprises sending information related to the target object to a web site.
 19. The apparatus of claim 1, wherein the action comprises sending information related to the target object to a browser.
 20. The apparatus of claim 1, wherein the action comprises displaying information related to the printed media object on a display screen.
 21. The apparatus of claim 1, wherein the image matching engine is further configured to identify the content information based on a location.
 22. The apparatus of claim 1, wherein the image matching engine is further configured to identify the content information based on a time.
 23. A mobile device method of interacting with a printed media object comprising: receiving, by a processor, a digital image of at least a portion of a printed media object, the digital image acquired by a mobile device; deriving, by the processor, salient features from the digital image; recognizing, by the processor, the printed media object as a target object based on the salient features; determining, by the processor, a position or an orientation of the printed media object with respect to a least a portion of a user within the digital image; identifying, by the processor, content information corresponding to the printed media object based on recognizing the printed media object as the target object, the content information available at an address and stored in a content database; and causing, by the processor, a second device to perform an action associated with the content information based, at least in part, on the position or the orientation of the printed media object with respect to the least the portion of the user within the digital image.
 24. The method of claim 23, wherein the salient features comprise an image descriptor.
 25. The method of claim 23, wherein the image descriptor comprises a scale-invariant feature.
 26. The method of claim 23, wherein the digital image comprises a digital still image.
 27. The method of claim 23, wherein the digital image comprises a digital video.
 28. The method of claim 23, wherein the mobile device comprises at least one of the following: a robot, a cell phone, and a personal data assistant.
 29. The method of claim 23, wherein the second device is the mobile device.
 30. The method of claim 23, wherein the second device is at least one of the following: a robot, a cell phone, a vehicle, a vending machine, a personal data assistant, and a television.
 31. The method of claim 23, further comprising receiving, by the processor, a user voice command.
 32. The method of claim 23, wherein the printed media object comprises at least one of the following: a passport, a driver's license, a newspaper, a magazine, an article, a billboard, a poster, a sign, an advertisement, a box, and a beverage can.
 33. The method of claim 23, wherein the content information includes at least one of the following: an instruction, a repair history, music data, audio data, video data, graphics data, an animation, and text data.
 34. The method of claim 23, wherein the action comprises an on-going interaction.
 35. The method of claim 23, wherein the action comprises accessing a web site or web service over a network.
 36. The method of claim 23, wherein the action comprises initiating a software process on the second device.
 37. The method of claim 23, wherein the action comprises initiating a phone call.
 38. The method of claim 23, wherein the action comprises initiating a radio communication.
 39. The method of claim 23, wherein the action comprises reading a portion of the printed media object.
 40. The method of claim 23, wherein the action comprises sending information related to the target object to a web site.
 41. The method of claim 23, wherein the action comprises sending information related to the target object to a browser.
 42. The method of claim 23, wherein the action comprises displaying information related to the printed media object on a display screen.
 43. The method of claim 23, further comprising identifying, by the processor, the content information based on a location.
 44. The method of claim 23, further comprising identifying, by the processor, the content information based on a time.
 45. A non-transitory computer-readable medium for enabling a mobile device to interact with a printed media object, comprising instructions stored on the medium and that configure a mobile device processor perform the steps of: receiving a digital image of at least a portion of printed media object, the digital image acquired by a mobile device; deriving salient features from the digital image; recognizing the printed media object as a target object based on the salient features; determining a position or an orientation of the printed media object with respect to a least a portion of a user within the digital image; identifying content information corresponding to the printed media object based on recognizing the printed media object as the target object, the content information available at an address and in a content database; and causing a second device to perform an action associated with the content information based, at least in part, on the position or the orientation of the printed media object with respect to the least the portion of the user within the digital image.
 46. The non-transitory computer-readable medium of claim 45, wherein the salient features comprise an image descriptor.
 47. The non-transitory computer-readable medium of claim 45, wherein the image descriptor comprises a scale-invariant feature.
 48. The non-transitory computer-readable medium of claim 45, wherein the digital image comprises a digital still image.
 49. The non-transitory computer-readable medium of claim 45, wherein the digital image comprises a digital video.
 50. The non-transitory computer-readable medium of claim 45, wherein the mobile device comprises at least one of the following: a robot, a cell phone, and a personal data assistant.
 51. The non-transitory computer-readable medium of claim 45, wherein the second device is the mobile device.
 52. The non-transitory computer-readable medium of claim 45, wherein the second device is at least one of the following: a robot, a cell phone, a vehicle, a vending machine, a personal data assistant, and a television.
 53. The non-transitory computer-readable medium of claim 45, wherein the instructions further configure the mobile device processor to perform the step of receiving a user voice command.
 54. The non-transitory computer-readable medium of claim 45, wherein the printed media object comprises at least one of the following: a passport, a driver's license, a newspaper, a magazine, an article, a billboard, a poster, a sign, an advertisement, a box, and a beverage can.
 55. The non-transitory computer-readable medium of claim 45, wherein the content information include at least one of the following: an instruction, a repair history, music data, audio data, video data, graphics data, an animation, and text data.
 56. The non-transitory computer-readable medium of claim 45, wherein the action comprises an on-going interaction.
 57. The non-transitory computer-readable medium of claim 45, wherein the action comprises accessing a web site or web service over a network.
 58. The non-transitory computer-readable medium of claim 45, wherein the action comprises initiating a software process on the second device.
 59. The non-transitory computer-readable medium of claim 45, wherein the action comprises initiating a phone call.
 60. The non-transitory computer-readable medium of claim 45, wherein the action comprises initiating a radio communication.
 61. The non-transitory computer-readable medium of claim 45, wherein the action comprises reading a portion of the printed media object.
 62. The non-transitory computer-readable medium of claim 45, wherein the action comprises sending information related to the target object to a web site.
 63. The non-transitory computer-readable medium of claim 45, wherein the action comprises sending information related to the target object to a browser.
 64. The non-transitory computer-readable medium of claim 45, wherein the action comprises displaying information related to the printed media object on a display screen.
 65. The non-transitory computer-readable medium of claim 45, wherein the instructions further configure the mobile device processor to perform the step of identifying the content information based on a location.
 66. The non-transitory computer-readable medium of claim 45, wherein the instructions further configure the mobile device processor to perform the step of identifying the content information based on a time. 